purrrEmorie D Beck
February 21, 2020
Iteration is everywhere. It underpins much of mathematics and statistics. If you’ve ever seen the \(\Sigma\) symbol, then you’ve seen (and probably used) iteration.
Reasons for iteration:
- reading in multiple files from a directory
- running the same operation multiple times
- running different combinations of the same model
- creating similar figures / tables / outputs
for loopsEnter for loops. for loops are the “OG” form of iteration in computer science. The basic syntax is below. Basically, we can use a for loop to loop through and print a series of things.
## [1] "a"
## [1] "b"
## [1] "c"
## [1] "d"
## [1] "e"
The code above “loops” through 5 times, printing the iteration letter.
_apply() familyA somewhat faster version of for loops comes from the _apply() family of functions, including: apply(), lapply(), sapply(), and mapply(). Unlike for loops, these are vectorized, which makes them more efficient.
## [1] "a"
## [1] "b"
## [1] "c"
## [1] "d"
## [1] "e"
## [[1]]
## [1] "a"
##
## [[2]]
## [1] "b"
##
## [[3]]
## [1] "c"
##
## [[4]]
## [1] "d"
##
## [[5]]
## [1] "e"
## [1] "a"
## [1] "b"
## [1] "c"
## [1] "d"
## [1] "e"
## a b c d e
## "a" "b" "c" "d" "e"
## [1] "a"
## [1] "b"
## [1] "c"
## [1] "d"
## [1] "e"
## a b c d e
## "a" "b" "c" "d" "e"
purrr and _map_() functionsToday, though, we’ll focus on the map() family of functions, which is the functions through which purrr iterates.
## [1] "a"
## [1] "b"
## [1] "c"
## [1] "d"
## [1] "e"
## [[1]]
## [1] "a"
##
## [[2]]
## [1] "b"
##
## [[3]]
## [1] "c"
##
## [[4]]
## [1] "d"
##
## [[5]]
## [1] "e"
For a more thorough comparison of for loops, the _apply() family, and _map_() functions, see https://jennybc.github.io/purrr-tutorial/
purrr and _map_() predicatesToday, though, we’ll focus on the map() family of functions, which is the functions through which purrr iterates.
Note that this returns a list, which we may not always want. With purrr, we can change the kind of output of map() by adding a predicate, like lgl, dbl, chr, and df. So in the example above, we may have wanted just the characters to print. To do that we’d call map_chr():
## [1] "a"
## [1] "b"
## [1] "c"
## [1] "d"
## [1] "e"
## [1] "a" "b" "c" "d" "e"
Note that it also returns the concatenated character vector as well as printing each letter individually (i.e. iteratively).
purrr and _map_() antecedentsmap_()map2_()pmap_()## [1] "a 1" "b 2" "c 3" "d 4" "e 5"
Note here that we can use map2() and pmap() with the predicates from above.
There are a number of different cases where purrr and map() maybe useful for reading in data including:
For this first example, I’ll show you how this would look with a for loop before I show you how it looks with purrr.
Assuming you have all the data in a single folder and the format is reasonably similar, you have the following basic syntax:
This works fine in this simple case, but where purrr really shines in when you need to make modifications to your data before combining, whether this be recoding, removing missing cases, or renaming variables.
But first, the simple case of reading data. The code below will download a .zip file when you run it. Once, you do, navigate to your Desktop to unzip the folder. You should now be able to run the rest of the code.
data_source <- "https://github.com/emoriebeck/R-tutorials/raw/master/wustl_r_workshops/purrr.zip"
data_dest <- "~/Desktop/purrr.zip"
download.file(data_source, data_dest)data_path <- "~/Desktop/purrr"
df1 <- tibble(
ID = list.files(sprintf("%s/data/example_1", data_path))
) %>%
mutate(data = map(ID, ~read_csv(sprintf("%s/data/example_1/%s", data_path, .)))) %>%
unnest(data) The code above creates a list of ID’s from the data path (files named for each person), reads the data in using the map() function from purrr, removes the “.csv” from the ID variable, then unnests the data, resulting in a data frame for each person.
purrrOn the previous slide, we saw a data frame inside of a data frame. This is called a list column within a nested data frame.
In this case, we created a list column using map, but one of the best things about purrr is how it combines with the nest() and unnest() functions from the tidyr package.
We’ll return to nest() later to demonstrate how anything you would iterate across is also something we can nest() by in long format data frames.
Now, we’re going to combine with what we learned about last time with codebooks.
data_path <- "~/Desktop/purrr"
codebook <- sprintf("%s/data/codebook_ex1.csv", data_path) %>% read_csv(.)
codebookNow, that we have a codebook, what are the next steps?
df1 <- tibble(
ID = list.files(sprintf("%s/data/example_1", data_path))
) %>%
mutate(data = map(ID, ~read_csv(sprintf("%s/data/example_1/%s", data_path, .)))) %>%
unnest(data)
df1old.names <- codebook$old_name # pull old names in raw data from codebook
new.names <- codebook$new_name # pull new names from codebook
df1 <- tibble(
ID = list.files(sprintf("%s/data/example_1", data_path))
) %>%
mutate(data = map(ID, ~read_csv(sprintf("%s/data/example_1/%s", data_path, .)))) %>%
unnest(data) %>%
select(ID, count, one_of(old.names)) %>% # select columns from codebook in loaded data
setNames(c("ID", "count", new.names)) # rename columns with new names
df1With these kinds of data, the first thing, we may want to do is look at within-person correlations, which we can do with purrr.
(nested.r <- df1 %>%
group_by(ID) %>%
nest() %>%
mutate(r = map(data, ~cor((.) %>% select(-count), use = "pairwise"))))We can access it like a list:
To run separate models for each trait, we’ll need to reshape our data.
To create composites, we’ll separate traits from items.
df.long <- df1 %>%
gather(key = item, value = value, -count, -ID, -satisfaction, na.rm = T) %>%
separate(item, c("Trait", "item"), sep = "_")
df.longTo create composites, we’ll then group_by() trait, count, and ID and calculate the composites using summarize()
df.long <- df1 %>%
gather(key = item, value = value, -count, -ID, -satisfaction, na.rm = T) %>%
separate(item, c("Trait", "item"), sep = "_") %>%
group_by(ID, count, Trait) %>%
summarize(value = mean(value))
df.longThen we’ll get within-person centered values using scale().
df.long <- df1 %>%
gather(key = item, value = value, -count, -ID, -satisfaction, na.rm = T) %>%
separate(item, c("Trait", "item"), sep = "_") %>%
group_by(ID, count, Trait, satisfaction) %>%
summarize(value = mean(value)) %>%
group_by(ID, Trait) %>%
mutate(value_c = as.numeric(scale(value, center = T, scale = F)))
df.longAnd grand-mean centered within-person averages
df.long <- df.long %>%
group_by(ID, Trait) %>%
summarize(value_gmc = mean(value)) %>%
group_by(Trait) %>%
mutate(value_gmc = as.numeric(scale(value_gmc, center = T, scale = F))) %>%
full_join(df.long)
df.longAnd now we are ready to run our models. But first, we’ll nest() our data.
And now run the models.
nested.mods <- df.long %>%
group_by(Trait) %>%
nest() %>%
mutate(model = map(data, ~lmer(satisfaction ~ value_c * value_gmc + (1 | ID), data = .)))
nested.modsAnd get data frames of the results:
nested.mods <- df.long %>%
group_by(Trait) %>%
nest() %>%
mutate(model = map(data, ~lmer(satisfaction ~ value_c * value_gmc + (1 | ID), data = .)),
tidy = map(model, ~tidy(., conf.int = T)))
nested.modsWhich we can print into pretty data frames
Which we can pretty easily turn into plots:
nested.mods %>%
select(Trait, tidy) %>%
unnest(tidy) %>%
filter(term == "value_c") %>%
ggplot(aes(x = Trait, y = estimate, ymin = conf.low, ymax = conf.high)) +
geom_errorbar(position = "dodge", width = .1) +
geom_point(aes(color = Trait), size = 2) +
geom_hline(aes(yintercept = 0), linetype = "dashed") +
labs(y = "Personality State-\nSatisfaction Association") +
coord_flip() +
theme_classic() +
theme(legend.position = "none")